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FINAL  REPORT 

“Robust  Functionality  and  Active  Data  Management 
for  Cooperative  Networks  in  the  Presence  of  WMD 

Stressors” 

Majeed  M.  Hayat1,  Patrick  G.  Bridges1,  Yasamin  Mostofi1,  and  Patricia  Crowley2 

In  this  project,  we  have  begun  the  development  of  a  rigorous  probabilistic  framework  that  has 
enabled  new  understanding  and  control  of  distributed  networks ?  vulnerabilities  to  WMD- induced 
failures  through  a  quantitative  assessment  of  basic  net  work- functionality  metrics.  We  have  devel¬ 
oped  a  general  stochastic  queuing  model  and  performed  basic  analysis  to  characterize  the  statistics 
of  the  task  completion  time  treated  as  a  random  variable.  Using  the  developed  framework,  we 
have  established  a  theoretical/computational  optimization  tool  that  maximizes  a  network's  robust  ¬ 
ness  to  node/link  failures  by  constantly  redistributing  the  network's  computational  loads.  The 
framework  also  includes  a  rigorous  treatment  for  describing  the  time  evolution  of  the  functionality 
of  any  network  subject  to  multiple  random-in-space-and-time  disruption  and  restoration  events. 
Next,  we  have  developed  a  theory  to  describe  and  solve  the  problem  of  a  network  attempting 
to  reach  consensus  on  the  occurrence  of  a  WMD  attack  based  upon  network  data.  To  this  end. 
we  have  developed  the  mathematical  foundations  of  networked  binary  consensus  over  noisv  links. 
Finally  we  have  considered  real-time  adaptation  of  computation  and  storage  in  distributed.,  unreli¬ 
able  (including  sensor)  networks  in  the  face  of  significant,  potentially  correlated  failures  induced  by 
WMD  stressors:  we  have  established  methods  for  low-overhead  monitoring  systems  that  can  drive 
decent  r  ali  zed  adap  t  at  ion . 

In  what  follows  we  described  the  details  of  the  basic  research  accomplished  in  this  project  as 
summarized  above.  We  divide  the  description  into  four  categories:  (i)  Probabilistic  Network  Model¬ 
ing  and  Resilient  Task  Reallocation,  (h)  Probabilistic  Network  Modeling  of  Network  Functionality, 
hi)  Network  Health  Assessment;  and  (iv)  Active  Data,  Management.  Participants  in  these  activities, 
including  students,  are  listed  in  the  respective  sections  below, 

1-  Probabilistic  Network  Modeling  and  Resilient  Task  Reallocation 
Executive  Summary 

A  rigorous  probabilistic  framework  to  analytically  characterize  the  execution  time  of  workloads 
in  distributed  computing  systems  (DCSs)>  subject  to  stochastic  topological  changes  due  to  WMD 
attacks  was  developed  as  a  result  of  this  effort  .  The  developed  characterization  considered  a  group  of 
heterogeneous  and  geographically  dispersed  computing  nodes,  uncertainties  in  the  communication 
network  due  to  random  non- negligible  delays  and  stochastic*  long-term  node  failures  due  to  WMD 
attacks.  The  metric  employed  to  assess  the  performance  ol  the  DCS  is  the  service  reliability,  which 
was  defined  as  the  probability  of  executing  a  workload  before  all  the  computing  nodes  fail.  Resilient, 
task  reallocation  policies  were  obtained  by  solving  a  constrained  optimization  problem  whose  cost 
function  employs  the  rigorous  model  developed  for  the  service  reliability  of  workloads.  An  algorithm 
that  scales  linearly  with  the  number  of  nodes  was  also  derived  to  reduce  the  computing  complexity 
of  the  optimization  problem.  The  mathematical  model  was  validated  using  Monte-Carlo  (MC) 
simulations  and  experimental  data  collected  from  a  testbed  DCS. 

1Tiie  University  Of  New  Mexico,  Albuquerque,  NM 

2Gonzaga  University,  Spokane.  WA 
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Design,  Optimization  and  Management  of  Heterogeneous  Networked  Systems  (DOM-HetNetS 
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•  M.  M.  Ilayat.  .j.  E.  Pezoa,  D.  Dietz,  and  S.  Dhaka, I,  “Dynamic  load  balancing  for  robust 
distributed  computing  in  the  presence  of  topological  impairments/’  Wiley  Handbook  of  Science 
and  Technology  for  Homeland  Security,  2009. 

Technical  Summary 

Introduction :  DCSs  offer  a  flexible,  reliable,  and  powerful  cooperative  computing  platform.  When 
DCSs  operate  in  harsh  or  threat-prone  environments,  factors  such  as  limited  or  intermittent  com¬ 
munication  resources  or  long-term  physical  damage  oi  the  computing  nodes,  can  result  in  random 
topological  changes  in  the  DCS,  which,  in  turn,  can  severely  degrade  the  performance  and  reliabil¬ 
ity  of  DCSs.  From  this,  it  is  mandatory  to  develop  control  strategies  for  increasing  the  robustness 
of  networks  when  a  threat  is  present.  In  this  report  it  is  described  how  intelligent  task  realloca¬ 
tion  strategies,  and  tlicir  mathematical  stochastic  models,  can  be  exploited  to  increase  the  DCS  s 
robustness  to  random  topological  changes,  and  simultaneously,  how  to  use  the  available  comput¬ 
ing  resources  of  the  system  efficiently,  in  the  presence  of  communication  uncertainty  and  node 
dysfunction  inflicted  by  WMD  attacks  developed  locally  at  UNM. 

Mathematical  model :  In  order  to  describe  the  reliability  of  a  DCS  in  the  presence  of  WMD 
attacks,  we  constructed  a  recursive  model  for  the  execution  time  of  a  workload  served  by  a  DCS 
[2,7,8].  The  model  constructed  predicted  accurately  both  the  execution  time  of  a  workload  and  the 
reliability  in  executing  a  workload.  The  main  assumption  made  in  the  development  of  the  model 
is  that  all  the  random  times  governing  the  dynamics  of  the  system  are  exponentially  distributed. 
This  assumption  was  key  in  order  to  obtain  a  tractable  and  computationally  simple  mathematical 
model  for  the  reliability  [2,7,8]. 

It  was  shown  that,  under  the  assumption  of  exponentially  distributed  random  times,  the  con¬ 
figuration  of  a  DCS  can  be  described  using  the  following  quantities:  (i)  the  number  of  tasks  queued 
at  each  node;  (ii)  the  functional  or  dysfunctional  state  of  each  node  in  the  system;  and  (iii)  the 
amount  of  tasks  in  transit  over  the  communication  network  [2,7,8].  In  order  to  mathematically 
describe  at  any  time  the  state  of  an  n-node  DCS,  we  introduced  in  our  analysis  three  state  vectors: 
the  system-queue,  the  system-function  and  the  network  state  vector.  The  system-queue  state  vec¬ 
tor,  M(f),  describes  the  number  of  tasks  queued  at  each  node  in  the  system.  The  system-function 
vector,  F(f),  is  a,  binary  vector  specifying  the  working  or  failed  state  of  the  nodes,  while  the  network 
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state  vector,  C(t),  describes  the  uumber  of  tasks  being  transferred  to  the  nodes.  In  addition,  to 
the  state  vector,  we  introduced  in  our  characterization  for  the  execution  time  two  parameters  that 
define  a  task  reallocation  policy.  The  first  parameter  introduced  was  the  reallocation  instant,  a. 
non-negative  real  number,  t&,  that  specifies  the  instant  when  the  task  reallocation  should  be  exe¬ 
cuted.  The  second  parameter  introduced  was  the  reallocation  strength,  an  n-by-n  matrix  K  that 
specifies  the  number  of  tasks  to  reallocate  between  all  pairs  of  nodes.  These  two  parameters  answer 
the  following  fundamental  questions  in  distributed  computing:  (i)  When  nodes  have  to  execute  the 
task  reallocation  policy?  (ii)  How  many  tasks  have  to  be  reallocated  among  the  nodes?  and  (hi) 
Which  nodes  are  appropriate  to  receive  extra  work  from  the  other  nodes?  In  this  effort,  these  pa¬ 
rameters  were  determined  after  solving  a  constrained  optimization  problem  that  aims  to  maximize 
the  service  reliability  of  the  DCS. 

Next,  the  execution  time  of  a  workload,  denoted  by  rK(4;  M0,  F0,  C0),  was  defined  as  the 
random  time  taken  by  the  DCS  to  serve  its  entire  workload  if  the  task  reallocation  policy  executed 
is  as  specified  by  4  and  K ,  and  the  initial  system  configuration  at  it  =  0  is  as  specified  by  Mo  —  M  (0 ) , 
F0  =F(0)  and  Co  =  C(0).  With  this,  the  service  reliability  can  be  defined  as  the  probability  that  all 
the  tasks  are  served  before  all  nodes  fail,  that  is  f?j<(4;  Mo.  Fo,  Co)  =  P{7k(4;  M0,  F0,  Co)  <  oo}. 
By  exploiting  the  principle  of  stochastic  regeneration,  we  showed  that  the  service  reliability  of  an 
n-nodc  DCS  satisfies  the  difference-differential  equation  (1)  and  the  difference  equation  (2): 

^  ft  n  n 

{%?  Mq,  Fo,  Co)  —  ^  ^  {£&?  Mo  —  A  ,  Fo,  Co)  4-  A^ /?k  { t Fo,  Co) 

”  n  n  gi 

+  5Z  XJ  XvRk  (4;Mj0!,Fj0‘,  Co)  +  J2  Y*  X^iRK (tb' Mo  +  fii  ^  Fo, cf") 

i—l  j= 1 

n 

+ Xh #K (%;  Ml) ,  Fq* ,  C0yi)  ,-\Rk (%;  Q„,  Fo,  Co)  ( 1 ) 

1=1 


f?K(0;Qo,Fo,CG)  =J^^iJt!K(O;Qo-At>-JF0,Co) 


i=  1 


■i—l  j—l,jr£i 


n  9i 


+  EEf^(0;Qo  +  fii  IjiAiu  Fo,  C?*)  +  y;  -±RK  (0;  Qo,Fq,  <<)  . 


(2) 


J= 1 


i= I 


The  model  for  the  service  reliability  given  in  Equations  (1)  and  (2)  was  employed  to  search 
Tor  the  optimal  task  reallocation  instant,  tJj:  and  the  optimal  task  reallocation  strength,  K*,  that 
maximizes  the  service  reliability.  This  was  performed  by  solving  the  constrained  optimization 
problem 


(f£,K*)  =  argmax  fiK(4;  Qo,  F0,  C0)  (3) 

(4.K) 

subject  to  4  >  0  and  K,j  e  [0. 1] . 

Algorithm:  Solving  the  optimization  problem  (3)  is  computationally  expensive  for  DCSs  with 
large  number  of  nodes,  as  the  amount  oi  computations  grows  exponentially  in  the  number  of  nodes. 
For  example,  it  was  shown  that  the  complexity  in  solving  Equation  (1)  is  bounded  by  0(2"").  To 
avoid  such  complication,  an  algorithm  for  devising  task  reallocation  policies  was  developed,  see 
Appendix  A.  The  key  idea  was  to  decompose  an  n-node  system  into  several  two-node  DCSs  and 
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Figure  I  (a)  Theoretical  service  reliability  as  a  function  of  the  task  reallocation  strengh  K  for  a  fixed  reallocation 
instant,  (b)  Theoretical  simulated  and  experimental  service  reliability  as  a  function  of  the  task  reallocation  strcngh  of 
the  node  T  when  task  reallocation  is  executed  at  U  =  0.  In  the  upper  plot  K2\  -0,25  while  in  the  lower  plot  if 21  =0.9. 
(c)  Service  reliability  as  a  function  of  the  reallocation  instant  for  three  representative  reallocation  strenghts. 


exploit  the  exact  characterization  for  reliability  of  two- node  systems  provided  by  Equations  (1)  and 
(2).  The  algorithm  developed  was  proven  to  scale  linearly  with  the  number  of  nodes  in  the  DCS  [S]. 

Key  results:  Evaluations  for  a  two-node  DCS  wrere  conducted  in  order  to  assess  the  performance 
of  the  reallocation  policies  devised  by  solving  (3).  Figure  1(a)  shows  the  theoretical  service  reliability 
as  a  function  of  the  reallocation  strengths  Ku  and  K2] .  Figure  1(b)  shows  the  theoretical ,  MC 
simulated  and  experimental  service  reliability  as  a  function  Kn,  for  a  fixed  value  of  K2 1,  where  the 
last  two  parameters  are  the  components  of  K  and  govern  the  intensity  of  load  transfer  between  the 
two  nodes.  Notably,  the  theoretical  and  experimental  results  showed  an  excellent  agreement.  From 
the  theoretical  curves,  it  was  observed  that  the  service  reliability  seems  to  be  a  concave  function 
of  the  number  of  tasks  to  be  reallocated  among  the  nodes,  that  is,  a  concave  (reallocation  strength 
parameter).  Figure  1(c)  shows  the  service  reliability  as  a  function  of  the  reallocation  instant,  tor 
three  representative  selections  of  the  reallocation  strength.  It  was  observed  again  a  remarkable 
agreement  among  theoretical  prediction,  Monte-Carlo  (MC)  simulation,  and  experimental  results. 
From  the  figure,  it  was  clearly  observed  that  a  proper  choice  of  K  resulted  in  an  increase  of  the 
service  reliability  It  was  noted  also  that  an  incorrect  selection  for  the  reallocation  strength  can  be 
compensated  by  delaying  the  reallocation  action. 

The  accuracy  of  the  model  in  predicting  the  reliability  was  also  assessed.  Predictions  generated 
by  the  model  for  reliability  were  compared  with  MC  simulations  where  the  distributions  of  the 
random  times  are  non-exponential.  Via  MC  simulations  the  service  reliability  of  a,  uon-Markovian 
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DCS  was  estimated  with  a  95%  confidence.  The  estimated  reliability  is  plotted  in  Figure  1(a).  It 
was  noted  from  the  figure  that  the  exponential  model  for  reliability  is  very  accurate  and  yields 
a  relative  approximation  error  below  4%.  Further  simulations  showed  that,  as  the  ratio  between 
the  average  transfer  time  and  the  average  service  time  of  the  nodes  increases,  the  exponential 
approximation  looses  its  accuracy  in  predicting  the  reliability.  Specifically,  approximation  errors  of 
120%  were  found  when  the  ratio  between  the  times  is  five. 


Table  1.  Service  reliability  achieved  by  three  reallocation  policies,  which  have  different  reallocation  criteria 
For  comparison  purposes,  the  optimal  values  obtained  for  at  case  are  listed. 


Initial  load 
(mi, . . . ,  mz) 

Service  reliability 

Max- Service 

Pro  c-  Speed 

Complete 

Optimum 

(150,0.0,0,(1) 

0.509 

0.511 

0.573 

0.631 

(0,150,0,0,0) 

0.614 

0.610 

0.617 

0.617 

(0,0,150,0,0) 

0.601 

0.591 

0.601 

0.601 

(0,0,0.150,0) 

0.583 

0.533 

0.612 

0.615 

(0,0,0,0,150) 

0.543 

0.566 

0.613 

0.619 

(50,30,30,30.30) 

0.634 

0.603 

0.636 

0.657 

(59,2,4,34,51) 

0.556 

0.608 

0.638 

0.668 

(18,55,29,27,21) 

0,642 

0.623 

0.640 

0.649 

(26,30,28.38,28) 

0,642 

0.639 

0.642 

0.642 

(40,15,40,35,20) 

0.624 

0.610 

0.643 

0.656 

The  effect  of  the  selection  of  a  reallocation  criteria  on  the  service  reliability  was  studied  in  DCSs 
with  multiple  nodes.  Three  reallocation  policies,  each  one  of  them  having  a  different  reallocation 
criterion,  were  studied.  The  Maximal-Service  strategy  reallocated  taken  when  workload  of  the  DCS 
is  imbalanced  with  respect  to  the  relative  reliability  of  the  nodes.  The  Pro  cessing- Speed  strategy 
triggered  a  reallocation  action  when  workload  is  imbalanced  with  respect  to  the  relative  processing 
rate  of  the  nodes.  Finally,  t  he  Complete  reallocation  strategy  triggered  a  reallocation  action  when 
workload  is  imbalanced  with  respect  to  the  combined  processing  and  failure  rates.  The  results 
of  the  evaluations  conducted  are  listed  in  Table  1,  The  results  listed  in  Table  1  showed  that,  in 
most  of  the  eases,  the  three  policies  achieve  approximately  the  same  performance,  which  shows  the 
strength  of  the  developed  approach  in  modeling  and  optimizing  reliability.  The  performance  of  the 
optimal  task  reallocation  strategies  was  evaluated  and  it  was  noticed  that  the  service  reliability  was 
improved  up  to  65%  as  compared  to  the  reliability  provided  by  a  DCS,  and  up  to  22%  as  compared 
to  policies  that  considered  nodes'  reliability  but  disregarded  the  communication  costs  over  the 
network.  Moreover,  the  algorithm  developed  in  this  work  to  devise  task  reallocation  strategies 
achieved  values  for  service  reliability  within  70%  of  the  optimal  service  reliability,  and  in  cases 
achieved  the  optimal  value. 

As  a  result  of  this  work,  fundamental  tradeoffs  and  interplays  between  the  different  parameters 
governing  the  dynamics  of  DCSs  were  identified.  First,  due  to  limitations  in  the  communication 
infrastructure,  there  is  a  tradeoff  between  delaying  a.  reallocation  action  (in  order  to  have  an  accurate 
account  of  the  working  or  failed  state  of  nodes)  and  immediately  execute  the  reallocation  strategy 
(to  avoid  wasting  valuable  computing  time).  The  mathematical  characterization  for  the  reliability 
developed  during  this  effort  enabled  us  to  optimally  select  when  the  reallocation  action  should  be 
taken.  Second,  it  was  discovered  that  there  is  an  interplay  between  the  task  transfer  time  and  the 
idle  time  of  the  nodes;  it  was  found  that  the  service  reliability  can  be  improved  if  the  idle  times  of 
the  nodes  arc  reduced  as  much  as  possible.  Third,  it  was  also  found  that  effective  task  reallocation 
policies  must  consider  the  reliability  of  the  nodes  as  a  parameter.  Fourth,  it  was  found  also  that 
it  is  mandatory  to  consider  the  task-transfer  delay  when  designing  task  reallocation  strategies . 
When  task-transfer  delays  are  relatively  larger  than  the  execution  tunc  of  tasks  at  the  nodes,  task 
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reallocation  policies  cannot  be  effective  due  to  the  excessive  transfer-time  taken  by  the  tasks  in  the 
network.  Finally,  it  was  found  that  the  model  for  the  service  reliability  yields  accurate  predictions 
in  operational  conditions  where  the  average  task  transfer  times  are  less  or  approximately  equal  to 
the  average  service  time  of  tasks. 

2.  Probabilistic  Network  Modeling:  The  Network  Functionality 
Executive  Summary 

We  have  completed  the  initial  phase  of  construction  of  a  mathematically  rigorous  formalism  to 
describe  the  time  evolution  of  the  performance  of  any  network  subject  to  multiple  random- in- space- 
and- time  disruption  and  restoration  events.  Our  main  tool  for  quantifying  network  performance 
is  the  notion  of  functionality,  which  is  roughly  the  ratio  of  (software)  task  execution  time  when 
the  network  is  in  an  unimpaired  state  [numerator]  to  task  execution  time  (for  the  identical  task) 
when  the  network  is  impaired  to  some  variable  degree  [denominator] .  The  initial  phase  quantifies 
the  time  evolution  of  the  functionalities  of  individual  network  components  (nodes  and  links);  the 
results  of  the  initial  phase  will  then  be  used  in  a  second  phase  in  which  the  individual  component 
behaviors  are  coalesced  to  describe  the  time  evolution  of  the  functionality  of  the  network  in  its 
entirety  This  model  is  applicable  to  network  attack  in  general,  and  to  WMD  network  attack  in 
particular. 

Publication 

*  D.  Dietz  and  M.  M,  Iiayat,  "A  Model  for  the  Time  Evolution  of  Networks  Subject  to  Random 
Multiple  Disruption  and  Restoration  Events,11  in  preparation. 

Technical  Summary 

Introduction:  We  have  completed  the  initial  phase  of  construction  of  a  mathematically  rigorous 
formalism  to  describe  the  time  evolution  of  the  functionality  of  a,  network  whose  components  are 
subject  to  multiple  random- in-space-and-time  impairment-causing  and  subsequent  functionality- 
restoring  events.  Many  situations  may  be  envisaged  lor  which  such  a  model  is  applicable.  One 
such  situation  of  particular  recent  high  interest  is  the  potential  attempt  by  terrorists  to  impair 
or  destroy  networks,  via  a  WMD  attack  or  otherwise,  which  form  an  integral  part  of  civilian  or 
military  infrastructure.  The  objective  of  a  terrorist  attack  on  such  a  network  is  to  cause  it  to 
function  not  at  all  or  only  partially  for  some  period  of  time,  ideally  "forever71  from  the  attackers1 
point  of  view.  Nevertheless,  it  is  to  be  expected  that  as  time  progresses  subsequent  to  an  attack, 
an  affected  network  will  be  restored  in  stages  to  operability  via  restoration  of  the  functionality  of 
some,  not  necessarily  all,  of  its  impaired  nodes  and  links  ("components17).  It  is  possible,  however, 
that  as  restoration  proceeds,  additional  attacks  upon  the  network  take  place  which  serve  to  impair 
previously  unaffected  components  as  well  as  to  re- impair  some  previously- impaired- but-restored 
components.  Any  network  component,  then,  is  subject  in  general  to  a  finite  time  sequence  of 
impairment  and  restoration  events  as  the  consequence  of  a  finite  time  sequence  of  attacks;  and 
the  time  evolution  of  the  functionality  of  the  network  as  a  whole,  being  describable  in  terms  of 
the  collection  of  time  evolutions  of  all  of  its  individual  components,  is  then  also  subject  to  such  a 
sequence- 

Iri  order,  therefore,  to  describe  the  time  evolution  of  the  functionality  of  a  network  subject  to 
the  conditions  described  above,  we  have  constructed  the  initial  portion— addressing  only  individual 
component  behavior-  of  a  model  of  network  behavior  which  provides  quantitative  predictions  of 
network  (partial)  functionality  as  each  of  the  network  components  progresses  in  time  through  its 
own  sequence  of  impairment  and  subsequent  restoration  events.  Since  the  timing  of  the  occurrences 
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of  the  impairment  events  and  the  amount  of  time  required  to  subsequently  restore  the  components 
are  both  stochastic!  this  model  is  also  a  stochastic  one.  This  model  consists  of  five  basic  concepts! 
namely  (1)  states,  (2)  functionality,  (3)  state  and  functionality  time  historiesT  (4)  software  task 
completion  time,  and  (5)  probability  spaces  for  state  and  functionality  time  histories.  We  now 
discuss  each  of  these  briefly  in  turn  in  the  context  of  a  network  consisting  of  C  components  (nodes 
and  links)  labeled  by  \i  —  lt  * . . .  C. 

Technical  Details 

States.  A  component  labeled  //  may,  at  any  given  instant,  be  in  one  of  several  possible  at t aim 
able  component  states  labeled  by  a0l  w ,  where  Q ^  is  a  positive  integer,  and  wc  write 

\  ^  *  t  ,  }  for  the  abstract  set  of  all  such  states.  The  state  cr ^  always  represents 

the  totally  nonop erational  condition  of  the  component  (he.,  being  “dead*’)  while  the  state  <Jq' 
always  represents  its  totally  operational  condition  (Ler)  being  capable  of  executing  any  appropriate 
assigned  task  at  full  performance  capacity  if  so  called  upon);  the  states  oq, . . ,  allow  for 

component  intermediate  levels  of  (partial)  oper at i Quality  between  the  two  extremes  and  are  labeled 
in  no  particular  order.  The  network  may  then,  at  any  given  instant,  he  in  one  of  several  possi¬ 
ble  network  states  belonging  to  the  set  =  {o^oq, , .  .,ergfci}  of  all  attainable 

network  states,  where  Ql°i  +  1  =  (<?j  +  1)(Q2  +  1) .  -  -  (Qc  +  1);  also,  er0  =  ( cr§\  a{2\ a^C)) 
(the  totally  nonfunctional  network  state)  and  ctq[c]  =  (er^ , er^ tr^}  (the  totally  functional 
network  state),  Functionality.  The  functionality,  T,  of  a,  component  or  a  network,  given  a 
software  task  to  be  executed  upon  it,  is  defined  notionally  as  the  ratio 

j.  _  Network  task  completion  time  under  unimpaired  conditions 

Network  task  completion  time  under  impaired  conditions  ' 

Functionality  is  task  dependent  and  in  general  0  <  T  <  1;  for  example,  T  —  1/2  corresponds 
to  a  task  execution  slowdown  by  a  factor  of  two.  Further,  for  a  component  task,  labeled  say  by  T. 
we  assign  a  numerical  task  dependent  component,  state  functionality  <p3( T)  to  each  of  the  states 
a j  of  via  the  component  state  functionality  mapping  ^IC:  :  E^  — *■  [0,  l]  according  to 
^f[CVi)  =  =  0, . .  .,Qlc].  where  F^C]  (or,)  =  0  =  d'o  and  {oQ[C])  =  1  =  <j>Q[C\.  We 

denote  the  set  of  numerical  component  state  functionalities  by  C  [0.1].  The  set  {^(T)}^' 
of  state  functionality  values  must  be  supplied  externally  to  the  model. 

Similarly,  for  a  network  task  labeled  by  X,  we  will  assign  a.  numerical  task  dependent  network 
state  functionality  T)  to  each  of  the  states  a-,  of  via  the  network  state  functionality 
mapping  JFf [<=I  :  E^  [0,1]  according  to  F%[C]  (er,-)  =  <pj(  T),j  =  0, .  where  ^[C] 

(ffo)  —  0  =  4>o  and  2*^  '  (o'qIC])  =  1  =  4>qic\  ■  We  will  denote  the  set  of  numerical  network 

state  functionalities  by  '  C  |0,  1].  The  set  of  state  functionality  values  must  be 

supplied  externally  to  the  model  (and  implicitly  incorporate  the  network  topology).  These  values 
may  be  computed  in  general  by  using  a  network  performance  code  (e.g.  OPNET). 

Time  Histories.  In  general,  component  p  proceeds  in  time  through  a  sequence  of  states  from 
E^'  as  determined  by  the  sequence  of  impairment  and  restoration  events  that  it  experiences.  We 
represent  this  time  evolution  of  the  component  instantaneous  state  by  a.  function  ftM  ;  py  oo)  — > 
E(^,  termed  a.  component  instantaneous  stale  time  history,  and  defined  by  h^(t)  =  a ^  if  the 
component  is  in  slate  u'f1  at  time  instant  t,  j  e  {0, . . . ,  Qft).  The  function  o  h  : 

[0,  oo)  —  C  ,  termed  a  component  instantaneous  functionality  time  history,  represents  the  time 
evolution  of  component  instantaneous  functionality  when  the  component’s  state  time  history  is  h 
and  the  component  task  is  T . 
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Similarly,  a  network  proceeds  in  time  through  a  sequence  of  slates  of  as  determined  by 
the  sequences  of  impairment  and  restoration  events  that  its  components  experience.  This  time 
evolution  of  the  network  instantaneous  state  will  be  represented  by  a  function  h:  [0,  do)  — ► 
termed  a  network  instantaneous  state  time  history ,  defined  by  h(i)  = 
indeed,  h(t)  =  <tj  if  the  network  is  in  state  <r3  at  time  t ,  j  6  {0, ,  ,  M  Q^}-  The  function  /(h>T)  = 

CJ  °  h  1  [0?°°)  ^  termed  a  network  instantaneous  functionality  time  history , 

will  represent  the  time  evolution  of  network  instantaneous  functionality  when  the  network’s  state 
time  history  is  h  and  the  network  task  is  T. 

Task  Completion  Time.  We  define  the  component  history  task  completion  time  for  a  com¬ 
ponent  software  task,  T,  running  on  a  single  computational  network  node  (links  do  not  participate 
hero),  which  node  is  running  only  that  task,  and  which  node,  in  addition,  has  associated  histories  h 
and  f(h'rK  Let  Aq  0  denote  the  length  of  the  time  interval  (duration)  required  for  the  entire  task, 
T,  to  run  from  start  to  completion  on  the  node  if  the  node  is  forever  unimpaired-always  in  state 
<tQ[C]  thus  having  state  history  h  =  <jq[c\  (he.,  h[^{t)  =  for  all  i  >  0)  and  state  functionality 

history  i  Aj  0  is  assumed  to  be  time  translation  invariant -independent  of  task  starting 

instant,  denoted  tp-on  the  forever- unimpaired  node.  In  our  model,  any  network  task,  T,  being 
executed  on  the  node  is  completely  characterized  by  its  Aj  (plus  its  on  node.  The 

component  history  task  completion  time  (duration)  for  the  node  having  time  histories  h  and 
j{b>r)  defined  by 

c\Al,  io\(f(hM'r))  - totfr' >  0  I  o;  r)  =  Aj}  (5) 

where 

t°*T (6) 

J 

The  mean  functionality ,  taken  over  the  entire  task  execution  time  interval,  of  node  instan¬ 
taneous  functionality  history  /^T))  is 

f°*c ,„|(/|M'’'T)).  in 

where  we  take  Ar.toi(j''h >,T^)  =  0  whenever  =  oo.  This  yields  precisely  the 

same  value  as  that  of  functionality  T  defined  by  the  ratio  in  words  given  earlier. 

Similar]}',  we  will  define  the  network  history  task  completion  time  for  a  network  software  task, 
T ,  running  entirely  on  the  network,  which  network  is  running  only  that  task  (or  collection  of  tasks) 
and  which  network,  in  addition,  has  associated  histories  h  and  Also,  we  will  let  Afy  >  0 

denote  the  length  of  the  time  interval  (duration)  required  for  the  entire  task,  T,  to  run  from  start  to 
completion  on  the  network  if  the  network  is  forever  unimpaired-always  in  state  &q\c\ -thus  having 
state  history  h  —  ctq{c)  and  state  functionality  history  /(hT)  =  ]  jg  assumec]  to  be  time 

translation  invariant  ^independent  of  task  starting  instant,  denoted  to  on  the  forever- unimpaired 
network.  In  our  network  model,  any  network  task,  T.  being  executed  on  the  network  will  be 
completely  characterized  by  its  Aj"  (plus  its  ])‘  The  network  history  task  completion 
time  (duration)  for  the  network  having  time  histories  h  and  f^hrc)  and  task  starting  instant 
t0>0  will  be  defined  by 

ctA0T;  t0](/(h"T)) * inf{r  >  0|r^TJ(^;r)  =  A^} 

where 

r%fr){to-,T)  =  j  fW(t)dt. 
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The  mean  functionality,  taken  over  the  entire  task  execution  time  interval,  of  network  instan¬ 
taneous  functionality  history  will  be 

mC/1"’)  “  [l/C|Ar*,,(/<^))]  rC  /<«>(()*  _  if/cW;  (/(VO,. 

where  we  take  ^ |ir;j|]](/^'^)  =  0  whenever  CjAT.(oj(/fh,T^)  =  oo.  This  yields  precisely  the  same 
value  as  that  of  functionality  T  defined  by  the  ratio  in  words  given  earlier. 

Probability  Spaces.  As  was  pointed  out  in  the  Introduction,  both  the  timing  of  the  occurrences  of 
impairment  events  in  a  network,  via  impairment  of  its  components,  and  the  amount  of  time  required 
to  (partially)  restore  t  he  network,  via  restoration  of  impaired  components,  are  stochastic.  In  other 
words,  each  possible  (state  or  functionality)  history  that  a  given  component,  hence  network,  may 
experience  as  a.  result  of  a  sequence  of  impairments  and  subsequent  restorations  must  be  treated  as  a, 
member  of  an  appropriate  probability  space.  Three  families  of  such  probability  spaces  are  required 
in  our  formalism:  one  family,  {(Ti^  .  )}^_ , ,  of  probability  spaces  for  component  state 

histories  associated  with  the  collection  of  all  the  components  which  the  network  comprises;  a 
second,  (TfE,  A E,  iP£),  for  network  state  histories;  and  a  third  family,  {('H*"r,  A?x .  'P<t‘T)}r> 
for  network  functionality  histories  indexed  by  task  label  T.  The  third  family  is  in  fact  induced  by 
the  second  which  is  in  turn  induced  by  the  first.  We  have  constructed  the  first  family  under  the 
current  effort;  the  second  and  third  families  will  bo  constructed  under  a  follow-on  effort.  We  point 
out  that  the  probability  measures  1  are  perfectly  arbitrary  and  are  to  be  inferred  from  the 
conditions  imposed  by  any  given  network  attack/rcstoration  scenario.  Also,  in  this  setting,  C[A-r.toj 
and  ^"fAr.toj  are  both  random  variables.  We  may  then  compute  distributions  and  expectations  (as 
well  as  higher  moments)  of  these  random  variables  with  respect  to  probability  assignment 
to  quantify  partially-impaired /partially-restored  component  effectiveness  vis-a-vis  any  component 
task.  Furthermore,  in  this  setting,  and  y.^  are  both  to  be  random  variables  in  (H*T , 

A*r,  VPr  );  we  will  then  be  able  to  compute  distributions  of  these  with  respect  to  probability 
assignment  to  quantify  partially-impaired/pa, rtially-restored  network  effectiveness. 

3.  Network  Health  Assessment 

Executive  Summary 

In  this  part,  we  consider  a  network  that  is  trying  to  reach  consensus  on  the  occurrence  of  a  WMD 
attack,  by  communicating  over  Additive  White  Gaussian  Noise  (AWGN)  channels.  We  develop  the 
mathematical  foundations  of  networked  binary  consensus  over  noisy  links.  We  first  consider  the  case 
where  the  nodes  do  not  have  any  knowledge  of  link  qualities.  We  show  that  the  asymptotic  behavior 
in  the  presence  of  any  amount  of  non-zero  communication  noise  becomes  unfavorable  as  the  network 
loses  the  memory  of  the  initial  state.  However,  wc  show  that  the  network  can  still  reach  and  stay 
in  accurate  consensus  for  a  long  period  of  time.  In  order  to  characterize  this,  we  derive  a  tight 
approximation  for  the  second  largest  eigenvalue  of  the  network  and  show’  how  it  is  related  to  the  size 
of  the  network  and  communication  noise  variance.  We  then  consider  the  case  where  knowledge  of 
the  corresponding  link  qualities  is  available  at  every  receiving  node.  We  extend  our  framework  and 
propose  novel  soft  information  processing  approaches  to  improve  the  performance  in  the  presence  of 
noisy  links.  We  show  that,  by  learning  the  voting  patterns,  we  can  solve  the  undesirable  asymptotic 
behavior  of  binary  consensus.  We  furthermore  characterize  the  impact  of  network  connectivity  on 
consensus  performance,  Finally,  we  show  the  underlying  tradeoffs  between  robustness  to  link  error 
and  optimization  of  information  flow  that  arise  in  networked  binary  consensus  over  noisy  links. 

Our  achievements  are  motivated  by  and  directly  related  to  a  network  that  is  under  a  WMD 
attack.  In  case  of  such  an  attack,  the  knowledge  available  for  proper,  timely  and  accurate  detection 
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of  it  is  limited,  sparse  and  prone  to  errors.  Then  distributed  assessment  of  network  health  in  the 
presence  of  uncertainties  becomes  considerably  important,  which  is  the  main  motivation  for  our 
developed  framework.  While  there  exists  several  work  on  estimation  consensus  problems,  detection 
consensus,  as  relevant  to  WMD  attacks,  lias  not  received  much  attention  in  the  literature.  Our 
proposed  framework  has  therefore  pushed  the  boundaries  of  what  was  known  in  the  literature. 

Personnel  Supported 

The  following  personnel  worked  on  problems  related  to  this  portion  of  the  grant: 

•  Graduate  Students:  Alejandro  Gonzalez  Ruiz  (fuUk right  scholar),  Yongxian  Ruan 

•  Faculty:  Yasamin  Mostofi  (Go-PI) 

Publications 

•  Y.  Mostofi,  “Binary  Consensus  with  Gaussian  Communication  Noise:  A  Probabilistic  Ap¬ 
proach,”  Proceedings  of  the  46th  IEEE  Conference  on  Decision  and  Control  (CDC).  Dec. 
2007 

•  Y .  Ruan  and  Y  Mostofi,  “Binary  Consensus  with  Soft  Information  Processing  in  Cooperative 
Networks,”  invited  paper,  47th  IEEE  Conference  on  Decision  and  Control  (CDC),  2008. 

•  Y.  Yuan  and  Y.  Mostofi,  “Impact  of  Link  Qualities  and  Network  Topology  on  Binary  Con¬ 
sensus,”  American  Control  Conference  (ACC).  2009 

•  M.  Malmirchcgini,  Y.  Ruan  and  Y.  Mostofi,  “Binary  Consensus  Over  Fading  Channels:  A 
Best  Affine  Estimation  Approach,”  IEEE  Globecom,  2008- 

•  A.  Gonzalez  Ruiz  and  Y.  Mostofi ,  “Distributed  Load  Balancing  over  Directed  Network  Topolo¬ 
gies,”  best  paper  in  the  session,  American  Control  Conference  (ACC),  2009. 

Technical  Summary 

Early  and  accurate  detection  of  the  presence  of  WMD  stressors  is  crucial  to  robust  recovery  of  the 
network-  The  knowledge  available  for  such  detection,  however,  is  limited  and  sparse.  Each  node 
will  make  a  detection  decision  based  on  the  information  that  is  available  to  it.  The  knowledge 
available  at  each  node,  however,  is  limited  and  could  be  corrupted  if  part  of  the  network  is  al¬ 
ready  compromised  by  the  attack.  Furthermore,  link  qualities  can  be  far  from  ideal  due  to  natural 
phenomena  such  as  noise  or  as  a  result  of  a  WMD  attack.  Therefore,  the  network  should  rely 
on  collective  information  processing  to  make  a  more  accurate  decision  on  the  detection  of  WMD 
stressors.  Consensus  problems  arise  when  a  group  of  distributed  nodes  need  to  reach  an  agreement 
on  the  value  of  a,  parameter  of  interest  and  can  be  categorized  into  two  main  groups:  Estimation 
Consensus  and  Detection  Consensus.  Estimation  consensus  refers  to  the  problems  where  the  pa¬ 
rameter  of  interest  can  take  values  over  an  infinite  set  or  an  unknown  finite  set.  In  general,  such 
problems  have  received  considerable  attention  in  the  literature  [9,10]  (except  for  considering  the 
impact  of  uncertainties  on  such  problems,  which  have  received  lesser  attention).  By  detection  con¬ 
sensus,  on  the  other  hand,  we  refer  to  the  problems  in  which  the  parameter  of  interest  takes  values 
from  a.  finite  known  set.  Then  the  update  protocol  that  each  agent  will  utilize  becomes  non-linear. 
Y/e  referred  to  a  subset  of  detection  consensus  problems  where  the  network  is  trying  to  reach  an 
agreement  over  a  parameter  that  can  only  have  two  values  as  binary  consensus  [5j.  For  instance, 
networked  detection  of  a  WMD  attack  falls  into  this  category.  While  there  exists  a  rich  literature 
on  estimation  consensus,  detection  consensus  problems  only  recently  started  to  receive  attention, 
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with  our  work  being  one  of  the  early  ones  in  this  area.  In  this  part,  we  discuss  our  main  results 
along  this  line. 

Consider  M  agents  that  want  to  reach  consensus  on  the  occurrence  of  a  WMD  attack.  Each 
agent  makes  a,  decision  on  the  occurrence  of  the  event,  based  on  its  one-time  local  sensor  measure¬ 
ment.  Let  6((0)  £  {0. 1}  represent  the  initial  decision  of  the  ith  agent,  at  time  step  k  —  0,  based  on 
its  local  measurement.  b.{  —  1  indicates  that  the  agent  votes  that  the  event  occurred  whereas 
b{  —  0  denotes  otherwise.  Each  agent  sends  its  vote  to  its  neighbors,  using  only  one  bit  of  infor¬ 
mation.  and  revises  its  vote  based  on  the  received  information.  Each  transmission  gets  corrupted 
by  the  receiver  noise,  which  is  best  modeled  by  Additive  White  Gaussian  Noise  (AWGN  channel) 
Let  represent  the  noise  at  the  klii  time  step  in  the  transmission  of  the  information  from  the 

fh  node  to  the  d*1  one.  Wj^(k)  is  a  zero-mean  Gaussian  random  variable  with  the  variance  of  <r2 . 

In  this  research  effort,  we  consider  a  connected  undirected  time- invariant  graph.  Consider  the 
case  where  the  fh  agent  can  communicate  (albeit  noisy)  to  the  ith  one.  Let  random  variable  bja{k) 
represent  the  reception  of  the  Ith  agent  from  the  transmission  of  the  j,h  one  at.  the  fcth  time  step 
for  k  e  1%.  We  will  have 

=  bj(k)  +  wn{k)s  for  j  e  (8) 

where  ’t;  represent, s  the  set  of  those  agents  that  can  communicate  to  the  ith  one  (including  itself) 
and  Wi'i  =  0.  We  have 

=  {z<(l)}^{2),...,Zi(Af4)}  for  Zi(j)  6  {1,2 (9) 

where  z;(j)  ^  z*(j')  for  j  ^  f,  1  <  j,j'  <  Nt,  and  1  <  <  M  represents  the  size  of  <f q,  Each 

agent  will  then  update  its  vote  based  on  the  received  information  as  follows: 

bi(k  + 1)  =  F  bg.@)si(k), . . . ,  bz,(Njti(k)) ,  for  Zi(j)  e  and  1  <  j  <  Nit  (10) 

where  bl:,  =  br  and  F(.)  represents  a  decision-making  function.  It  should  be  rioted  that  6*(fc  i  1)  is  a 
random  variable  through  its  dependency  on  the  reception  noise.  As  part  of  this  research  effort,  we 
consider  different  ways  of  building  the  decision-making  function  based  on  the  knowledge  available 
on  link  qualities.  Let  m  =  {0, 1, 2, ... ,  M]  represent  a  finite  set  and 

M 

S(k)  =  ^2bj(k)  £  ttM.  (11) 

i=i 

S(k)  is  a  measure  of  the  closeness  to  consensus  at  time  step  k. 

Definition :  We  say  that  the  network  is  in  an  accurate  consensus  state  at  the  klh  tunc  step  if 
and  only  if  the  following  holds: 

M 

if  S(0)  >  r-5-l  =*  Vi  biffin 
M 

if  5(0)  hi(k )  =  0, 

Due  to  the  presence  of  Gaussian  noise,  there  is  no  guarantee  in  reaching  or  staying  in  consensus.  In 
other  words,  there  is  no  transition  point  beyond  which  consensus  is  reached  permanently.  Rather 
than  that,  we  are  interested  in  evaluating  the  probability  of  reaching  and  staying  in  different  states 
of  the  network,  asymptotic  probabilistic  behavior  of  the  system  as  well  as  relating  these  parameters 
to  the  noise  variance,  size  of  the  network  and  network  connectivity. 
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3.1  Binary  Consensus  over  a  Fully-Connected  Graph  -  Case  of  Unknown  Link  Quality 

In  order  to  characterize  the  impact  of  noisy  links  on  binary  consensus  of  a  WMD  attack,  we  first 
consider  consensus  over  a  fully-connected  network,  where  there  is  a.  link,  albeit  noisy,  from  every 
node  to  every  other  node  in  the  network.  Under  full-connectivity  assumption,  we  will  have  the 
following  for  /c>0. 


bi(k  4-  1)  =  Dec 


=  Dec 


/  M  > 

mEm*) 

\  3=1  J 


(13) 


where  Dec(.)  represents  a  decision  function  for  binary  0-1  detection:  Dcc(x) 

Let  II„  (fe)  represent  the  probability  that  n  e  of  the  agents  are  voting 

nn(fc)  =  Prob[S(fc)  =  n\ ,  n&^M,k>  0.  (14) 

Let  P„im  represent  the  probability  of  the  network  going  from  state  n  to  state  m  in  one  time  step, 
Pn,m  -  Prob  [S(k  + 1)  =  m|S(A:)  =  n] ,  n, m  e  fl*/.  (15) 

More  specifically,  Pn,m  will  have  a  binomial  distribution  as  follows, 

Pn,m  —  “  Kn,M }M~m.  (16) 

where  represents  the  probability  that  any  agent,  votes  1  in  the  next  time  step,  given  a  current 
state  of  n  (S(k)  =  n).  This  probability  is  the  same  for  all  the  agents.  Consider  the  stlh  agent.  We 
will  have: 

where  Wi(*)  =  i  wjA^)  ^as  a  Gaussian  distribution  with  zero  mean  and  variance  of  oh  = 

(M— l)a5 

n(fc  +  i)  =  Frn(fc).  (is) 

Remark  1.  It  can  be  easily  confirmed ,  using  Gersgorin  disk  theorem  [3],  that  the  eigenvalues  of 
P  are  located  in  the  following  area:  (j£L0  {*  G  C  :  \z  -  Pnjn\  <  1  -  P„t7l>  .  Let  A0,fc;  AUc, . . . ,  XMifc 
represent  the  eigenvalues  of  matrix  P  in  a  decreasing  order:  |A0ifc|  >  j  Ai?^|  >  _ ,  >  This 

implies  that  Vn,  |An  fc|  <  1, 

Remark  2*  Assume  a  ^  0.  Then  P  >  0  (element- wis e) .  From  stochastic  property  of  matrix  _P, 
we  know  that  one  is  an  eigenvalue.  From  Remark  1  and  applying  PcrroiTs  theorem  [3],  we  will 
have } 

a)  ™  1  as  a  simple  eigenvalue  of  P, 
h)  jA^  ^I  <  !  for  n  ^  0  and 

c)  [P7  —fLa.sk—*  oo,  where  L  =  ZZfeR;  Z  =  PTZ,  Zie£t  =  PZhfti  and  ZrZh{t  =  1. 


K>n  M  =  b^rob 


n 


LS  +  VnW>.5 


f  1  x  >  ,5 
*~\0  x<  .5  ' 

1  at  the  klh  time  step: 
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Lemma  1-  Matrix  P  is  centro-symweiric ,  i.e.  the  (M  —  n ) th  row  (or  column)  of  matrix  P  is  a 
reverse  repeated  version  of  the  ntJj  row  (or  column)  forn  ef]jVf:  Pn,m  =  for  n,  m  e  Hm- 


Proof:  See  our  proof  in  [5]. 

If  u  /  0,  from  Remark  2  we  know  that  the  network  reaches  a  steady  state  asymptotically. 
Furthermore,  wc  will  have  \imk^U(k)  =  ZZ^IUO)  where  Z  and  ZMt  are  as  defined  in  Remark 

2. 

Remark  3.  Consider  ZZ(eflU(Q),  the  asymptotic  value  of  vector  IT.  0(0)  has  exactly  one  clement 
equal  to  one  and  the  rest  zero  while  is  a  vector  whose  elements  are  all  the  same.  Then,  Z£ftII(0) 
loses  the  information  of  the  initial  state.  Therefore,  the  asymptotic  value  will  be  independent  of 
the  initial  state. 

Remark  3  shows  that,  asymptotically,  the  information  of  the  initial  state  will  be  lost,  in  the 
presence  of  any  amount  of  rion-zero  communication  noise.  However,  the  system  can  still  reach  and 
stay  in  accurate  consensus  for  a  long  period  of  time  (which  could  be  enough  for  practical  purposes) . 
Fig.  2  shows  11(h)  as  a  function  of  time  step,  k,  for  M  =  4  agents.  Initially,  3  out  of  4  nodes  vote 
one  initially.  Then  the  system  is  in  accurate  consensus  at  time  step  k  if  all  nodes  are  voting  one. 
The  solid  line  shows  the  probability  of  having  an  accurate  consensus  (n4(fc)).  It  can  be  seen  that 
for  a  good  amount  of  time  (enough  for  practical  purposes) ,  the  system  will  be  in  accurate  consensus 
with  high  probability.  It  should  be  noted  that  the  link  noise  is  rather  high  for  this  example.  To 
see  the  impact  of  link  quality,  Fig.  3  show's  the  impact  of  link  quality  on  binary  consensus  over 
a  fully-connected  graph  of  4  nodes,  where  3  out  of  4  vote  L  initially.  In  general,  the  smaller  the 
communication  noise  variance  is,  the  higher  the  probability  of  reaching  and  maintaining  an  accurate 
consensus  wall  be,  as  can  be  seen  from  Fig.  3.  The  second  largest  eigenvalue  (Ai,fc)  plays  a  key 
role  in  determining  how  fast  the  network  is  approaching  its  steady  state.  The  closer  the  second 
eigenvalue  is  to  the  unit  circle,  the  slower  the  rate  of  convergence,  which  results  in  consensus  for  a 
longer  period  of  time.  We  will  next  derive  an  expression  for  the  second  largest  eigenvalue  of  matrix 
P. 


time  step  (k)  in  log  scale 

Figure  2.  Consensus  dynamics  over  a-  fully-connected  graph  -  case  of  unknown  link  quality,  M  =  4.  <r  =  0.5. 
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Figure  3.  Impact  of  link  quality  on  binary  consensus  -  case  of  fully- connected  graph  with  unknown  link 
quality,  M  =  4. 


Lemma  2.  Let  P  represent  a  transition  probability  matrix  generated  using  ku,m,  as  indicated  by 
Eq.  W.  We  will  have ? 


M _ i 

v; — ^  >  M  .  / 

C~2  m)  —  PM-njrn)  ~  "^“(1  — 

m=0 


(19) 


Proo/:  See  our  proof  in  [5]. 

Using  a  number  of  other  derived  lemmas,  we  can  then  prove  the  following  theorem  regarding 
the  second  largest  eigenvalue: 

Theorem  1.  Let  Papprox  represent  an  approximation  of  matrix  P  under  the  linearization  of  Q 
function  around  the  original  Let  X1  y^pprox  represent  the  second  largest  eigenvalue  of  Papprax. 
Then  we  will  have,  AlifeapprQX  =  1  -  2k0,m  =  1  -  2Q 

Proof:  See  our  proof  in  [5] . 

To  see  how  well  the  approximation  of  Theorem  I  works.  Fig*  4  shows  the  second  largest  eigen- 
value  and  its  approximation  as  a  function  of  a  and  for  M  =  4,  M  =  7  and  M  =  16.  As  can  be  seen, 
the  approximation  works  well  especially  for  smaller  M  and  larger  a.  The  proposed  approximation 
can  be  considerably  useful  in  understanding  the  behavior  of  group  consensus  in  terms  of  probability 
of  reaching  and  staying  in  consensus. 

3*2  Soft  Information  processing  -  Case  of  Known  Link  Quality 

In  this  section,  wc  consider  the  case  where  each  node  has  the  knowledge  of  its  link  qualities  (variance 
of  the  noise).  We  propose  novel  information  processing  techniques  for  improving  the  performance  of 
binary  consensus,  based  on  this  knowledge.  Then,  we  have  the  following  decision-making  function 
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Figure  4.  Approximation  of  the  second  largest  eigenvalue. 


at  each  node  after  linearization. 


Mfc  +  i)  =  Dec(  — 


= 


M 

bi(k)+  («m*)+/*) 


3=1 

M 


bi(k)  +  a  ^  bj(k) 

3=1 


4-  7  +  «Wi(fc)  I  , 


where  the  last  term  (alh(fc))  is  a  zero-mean  Gaussian  noise  with  the  variance  of  a2, 

1  2  (M-l)aV2  a  M  -  1 

Q_  TT4^,<Js  — m — 1  and7  = 


2  M 


;(1  -  a). 


We  can  extend  the  Framework  of  the  previous  section  and  define  the  following  variables: 

Kn,M\i  =Prob[6I(fc  +  1)  =  l\bi(k)  =  1  and  S(k)  =  n\ 


=Q 


'o.5-7-n(^iV 


,  1  <  *  <  M, 


*n,M\o  =Prob[6j(fc  +  1)  -  l|&j(A;)  =  0  and  S(k)  ~  n) 
/  0.5  —  7  — 

=Q  ( - - - J  ,1  <  i  <  M 


and 


■^soft  fn,m  —  Prob[6(fe  +  1)  —  m|5(A:)  =  n] 

^rt,7Tt 

=  X]  Kx' n-  Kn,M\i)fim  ~  xrM  —  n,  ren]M|0) 


(20) 


(21) 


(22) 


-  E  Q<Mi.(i-»»,M|,r-x(":")<^0(i-^,W0)"-” 


-n— m+x 


(23) 
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where  =  max(0,  m  +  n-  M)>  =  mi n{n,m)  and  f{x,n,q)  =  (£)g*(  1  -  q)n^  is  the  pdf 

of  a  binomial  distribution  with  q  as  the  success  probability.  The  rate  of  convergence  of  the  binary 
consensus  with  soft  information  processing  is  characterized  by  calculating  the  second  eigenvalue  of 
the  transition  matrix. 

Lemma  3*  kUjm\i  —  1  ^^M-nTM|o- 

Proof:  see  our  proof  in  [11]. 

Lemma  4.  Matrix  Psoh  is  a  centrosymmetric  matrix:  =  PSoft,n,m  for  0  <  n,  rn  <  M. 


Proof:  see  our  proof  hi  (11]. 

Theorem  2,  Let  Psoft^pprox  represent  the  transition  probability  matrix  generated  imder  the  lin¬ 
earization  of  Q  function  a  round  the  original  Let  Xf^pprox  represent  the  second  largest  eigenvalue 
of  PSoft,zPprox.  Then  we  have,  A*£apprax  =  1  -  2r?0  =  1  -  2 

Proof;  sec  our  proof  in  [11], 

lb  see  how  well  the  approximation  ol  Theorem  2  works.  Fig.  5  shows  the  second  largest  eigen¬ 
value  and  its  approximation  as  a  function  of  a  and  for  M  =  4,  M  =  10  and  M  =  20,  As  can  be 
seen,  the  approximation  works  well  especially  for  smaller  larger  a,  or  or  close  to  zero.  For  com¬ 
parison,  the  second  largest  eigenvalue  was  proved  to  be  AljfCiapptax  =  1  -  2Q(0^._i)  in  Theorem 
1  for  the  case  where  no  channel  knowledge  was  available. 


CJ 


Figure  5.  Approximation  of  the  second  largest  eigenvalue  for  basic  soft  case. 
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3.2.1  Soft  Information  Processing  with  Learning 

In  the  true  structure  of  soft  information  processing,  the  ith  node  needs  to  have  the  knowledge  of 
Prob[7)j(fc)  =  1]  for  j  ^  i.  Such  information  is  not  readily  available  in  the  receiver.  However,  it  can 
be  statistically  estimated.  Let  represent  the  ith  node’s  estimate  of  Prob[6j(A;)  -  1],  Then 

we  will  have  the  following  form  of  decision-making: 


bi(k  +  1)  =  Dec 


bi(k) + y: 


+  (1  “  Pj,i(k))  e 


(24) 


Fig.  6  showrs  the  performance  of  the  proposed  soft  information  processing  approaches  for  a 
network  of  4  nodes  with  u  =  1.  a  =  1  corresponds  to  Signal  to  Noise  Ratio  of  -3dB  per  link, 
which  is  extremely  low.  For  comparison,  the  performance  for  the  case  where  no  channel  knowledge 
is  available  is  also  plotted.  The  basic  soft  information  processing  (with  no  learning)  can  improve 
the  performance  of  reaching  consensus  over  the  network.  However,  its  asymptotic  behavior  is  still 
undesirable  as  the  probability  of  accurate  consensus  starts  to  decrease  after  a.  while.  It  can  be  seen 
that  the  proposed  soft  strategies  can  increase  the  probability  of  accurate  consensus  considerably.  It 
can  be  seen  that  by  estimating  the  probability  distribution  of  the  votes  of  other  nodes,  learning  soft 
can  improve  the  performance  considerably  and  have  a  desirable  asymptotic  behavior  (the  system 
preserves  the  memory  of  the  initial  state)  as  well  as  a,  superior  transient  behavior. 


Figure  6.  Performance  of  the  proposed  soft-information  processing  approaches  for  M  =  4  and  <r  =  1. 


3.3  Impact  of  Graph  Connectivity 

In  this  part  we  discuss  our  result  on  the  impact  of  graphs  that  are  not  fully  connected  on  the 
consensus  behavior.  More  specifically,  we  extend  our  framework  to  derive  an  expression  for  the 
second  largest  eigenvalue  assuming  a  time-invariant  connected  (undirected)  graph. 
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3,3,1  Case  of  Unknown  Link  Quality 

If  the  graph  is  not  fully  connected,  we  will  have 


bi(k  +  l) 

\  *  / 

=  D“(^E«*)+i:  E  "!«(*?) 

\  *ie«i  1  j 


(25) 


where  represents  the  set  of  those  agents  that  can  communicate  to  the  iih  one  (including  itself) 
and  Ni  represents  the  size  of  'l';. 


Theorem  3.  Lei  Tapptox  represent  t he  higher-order  dimensional  transition  probability  matrix  gen¬ 
erated  under  the  linearization  ofQ  Function  around  the  original  Let  N,  =  N  for  l  <  i  <  M.  Then, 
M,nfc, approx  =  1  —  is  the  second  largest  eigenvalue  of  Tapprox,  where  and 

subscript  “nfc  denotes  the  eigenvalues  for  the  case  of  not  fully-connected  graphs. 


Proof:  see  our  proof  in  [11,12]. 

3.3.2  Case  of  Known  Link  Quality 

The  analysis  of  basic  soft-information  processing  can  be  extended  to  not  fully-connected  graphs  as 
follows, 


bi(k  +  1) 


h(k)  A  a  ^2  bjity 


+  7i  4-  o/Wi{k) 


(26) 


where  W*(h)  =  wjAk)  and  7i  =  ^{l  -  a). 


Theorem  4.  Let  Tsoftiapprox  represent  the  higher-order 
generated  under  the  linearization  of  Q  function  a  round 
Then,  \f^fc  approx  =  1  -  2fr,0iV|0  is  the  second  largest 

Q{^~)  f°T  an.Y  1  <  i  <  M,  where  erf.  =  ^N'~^‘‘2°a . 


dimensional  transition  probability  matrix 
the  original.  Let  Ni  =  N  for  1  <  i  <  M. 
eigenvalue  of  Tsoft^ppT<Xi,  where  k0?JV|0  = 


Proof:  see  our  proof  in  [6] . 


4.  Active  Data  Management 
Executive  Summary 

This  portion  of  the  research  considered  computation  and  storage  in  distributed,  unreliable  sen¬ 
sor  networks  subject  to  significant,  potentially  correlated  failures  induced  by  WMD  stressors.  To 
enable  such  systems,  this  research  focused  on  lightweight  monitoring  systems  that  can  drive  de¬ 
centralized  adaptation.  Primarily,  we  considered  making  optimistic  adaptation  decisions  using 
previously-developed  systems  for  lightweight  gossiping  of  performance  data  on  existing  application 
messages  [13, 14).  As  part  of  this  research,  we  developed  a  set  of  metrics  to  quantify  the  suitability 
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of  an  application  to  gossip-based  monitoring  and  adaptation,  developed  a  variant  of  the  two-phase 
commit  protocol  needed  for  adaptation  using  gossiped  performance  data,  and  studied  how  effective 
gossip-based  performance  data  was  in  driving  load-balancing  decisions  in  distributed  applications. 
In  addition,  we  examined  system  software  extensions  to  allow  us  to  quantify  the  impact  of  failures, 
including  the  correlated  failures  caused  by  WMD  stressors  on  network  processing.  Preliminary  re¬ 
sults  from  this  research  demonstrate  the  ability  to  control  fault  injection  in  sensor  systems  both  in 
simulation  and  real  hardware,  providing  a  viable  testbed  for  performing  basic  networking  research 
on  WMD-induced  correlated  failures  in  sensor  systems. 
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•  Graduate  Students:  F.  Orlando  Arbildo,  Ricardo  Villalon 
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•  Postdoctoral  Fellows:  Patrick  M.  Widener,  Wenbiri  Zbu 

•  Faculty:  Patrick  Bridges  (PI),  UNM;  Patricia  Crowley  (Co-FI),  Gonzaga  University 
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toring  and  timing  with  embedded  gossip.”  IEEE  Transactions  of  Parallel  and  Distributed 
Systems  (TPDS),  20(7):1038  1049,  July  2009. 

•  F.  Orlando  Arbildo  and  Basak  Gocmen,  “A  Compact  Testbed  for  Wireless  Sensor  Networks”, 
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Technical  Summary 

Quantifying  Suitability  to  G  ossip- based  Monitoring:  Driving  adaptation  in  large-scale  failuroprone 
distributed  systems  such  as  battlefield  sensor  network  systems  requires  appropriate  monitoring  in¬ 
formation.  Recent  research  [141  has  shown  that  such  monitoring  can  bo  done  scalably  If  performance 
data  Is  gossiped  on  existing  application  messages  with  the  cost  of  sacrificing  the  global  consistency 
of  performance  data.  In  this  portion  of  the  research,  we  sought  to  quantify  the  usefulness  of  this 
gossip- based  approach  to  various  optimizations  by  developing  an  appropriate  set  of  metrics. 

We  developed  five  different  metrics  for  evaluating  the  suitability  of  a  given  application  to  gossip- 
based  distributed  monitoring  and  adaptation  techniques.  These  metrics  include  three  time-interval 
metrics  that  measure  the  largest  amount,  of  time  it  could  take  to  perform  a  portion  of  gossip- based 
communication  in  an  application,  summarized  in  figure  7,  as  well  as  two  others*  Specifically s  the 
five  metrics  we  have  chosen  are: 

•  Wait  time,  the  amount  of  time  between  taking  a  measurement  and  the  first  remote  node 
receiving  complete  monitoring  information. 

•  Propagation  time,  the  amount  of  time  between  the  first  and  the  last  remote  nodes  receiving 
complete  monitoring  information. 

•  Resolution  time,  the  amount  of  time  between  taking  a  measurement  and  the  last  remote 
node  receiving  complete  monitoring  information  (Le.?  the  sum  of  wait  time  and  propagation 
time) , 
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Figure  7.  Time- interval  metrics:  wait  time,  propagation  time,  and  resolution  time  are  defined  in  terms  of 
when  the  first  and  last  remote  nodes  receive  data  measured  at  time  f  =  0. 


•  Effectiveness,  the  number  of  complete  measurements  that  can  be  done  by  Embedded  Gossip 
during  an  application’s  execution. 

•  Monitoring  overhead,  the  cost  of  doing  embedded  gossiping  in  an  application. 

Figure  8(a)  shows  the  effectiveness  results  of  9  benchmarks  and  one  application.  We  can  see 
that  some  benchmarks,  B I ,  CG,  LU.  SP.  SMG,  and  CCELL,  can  do  more  than  500  one-to-all 
notifications  during  their  execution  time.  Based  on  this,  the  types  of  applications  represented  by 
each  of  these  benchmarks,  all  common  high-performance  application  types,  appear  to  be  amenable 
to  gossip-based  monitoring.  In  contrast.  Embedded  Gossip  performs  poorly  for  BP  and  IS  because 
of  the  embarassingly  parallel  nature  of  these  applications. 

Figure  8(b)  shows  the  timer  interval  metrics  for  four  benchmarks  and  one  application  that  all 
have  high  effectiveness  metrics,  CG,  LU,  MG,  SMG2000  and  ChemCell.  Despite  having  roughly 
similar  effectiveness,  the  breakdown  between  wait  time  and  propagation  time  in  these  applications 
varies  dramatically.  CG,  for  example,  spends  the  majority  of  their  resolution  time  waiting  for  a 
measurement  to  begin  to  propagate,  and  then  spreads  the  resulting  data  quickly  throughout  the 
application,  resulting  in  a  relatively  small  amount  of  time  during  which  different  process’s  global 
views  are  inconsistent.  Programs  like  LU  or  MG,  on  the  other  hand,  gradually  propagate  results 
throughout  the  application,  resulting  in  relatively  longer  propagation  tunes  than  wait  times.  Be¬ 
cause  of  this,  gossip- based  approaches  are  unlikely  to  be  appropriate  for  driving  global  adaptation 
in  programs  structured  like  LU  or  MG  despite  being  an  effective  measurement  tool  in  these  ap¬ 
plications,  Programs  like  SMG2000  and  ChamGe.ll  represent  a.  middle  ground  between  these  two 
application,  with  both  resolution  time  split  relatively  equally  between  wait  and  propagation  time. 

Optimistic  Adaptation  using  Gossiped  Information:  To  study  the  suitability  of  using  gossip- 
based  monitoring  to  drive  adaptation,  we  examined  using  gossiping  to  drive  load-balancing  deci¬ 
sions  in  a  distributed  application.  In  this  system,  monitoring  is  done  by  having  each  node  track  of 
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(a)  NAS  Benchmark  Effectiveness  (b)  Time  interval  analysis 

Figure  8.  Effectiveness  and  Time  Interval  Metric  Measurements  for  NAS  Benchmarks  and  ChemCcll  Syn¬ 
thetic  Application 


its  own  workload  and  by  gossiping  the  system-wide  maximum  and  minimum  of  these  local  work¬ 
loads.  By  comparing  the  gossiped  maximum  workload  with  the  local  workload  using  the  ratio 
maximum  /  local ,  a  process  can  determine  if  another  process  is  more  highly  loaded  than  the  local 
process.  Similarly,  by  comparing  the  gossiped  minimum  workload  with  the  local  workload  using 
the  ratio  local  j  minimum,  a  process  can  determine  whether  this  process  is  more  highly  loaded 
than  other  processes.  If  either  of  these  ratios  is  high,  the  process  can  decide  that  load  imbalance 
exists,  and  a  load  balancing  action  is  needed. 

Because  of  the  decentralized  nature  of  Embedded  Gossip,  however,  the  workload  estimates 
gathered  in  this  system  may  by  inconsistent  .  For  example,  one  process  may  have  noticed  imbalance 
and  decided  that  a  load  balancing  action  is  needed,  while  others  may  have  seen  no  imbalance 
at  all,  II  not  dealt  with  properly,  localized  load  checks  may  return  different  results  at  different 
processes,  which  can  cause  deadlock  where  some  processes  wait  indefinitely  for  global  synchronous 
load  balancing  action,  and  others  continue  with  their  computation. 

We  address  this  problem  by  developing  a  new  simplified  variant  of  the  two-phase  commit  al¬ 
gorithm  [4j  termed  3 -wait  and  shown  in  figure  9.  The  3-wait  algorithm,  like  the  standard  2PC 
protocol,  uses  a  coordinator  process  and  ensures  that  either  all  processes  commit  to  perform  an 
aetion  or  none  do.  Unlike  2PC,  however,  3- wait  assumes  that  there  are  that  vote-requests  happen 
implicitly  tlirough  the  gossiping  of  performance  information.  This  prevents  applications  that  do 
not  need  to  perform  any  commits  (for  example,  load  balancing  applications  running  a  data  set  that 
never  goes  out  of  balance)  from  paying  unnecessary  vote  request  and  commit  costs. 

To  investigate  the  performance  of  this  approach  across  a  range  of  scenarios,  we  evaluated  con¬ 
ducted  additional  experiments  using  the  imb-please  benchmark  over  a  wide  range  of  synthetic  load 
balancing  parameters.  In  particular,  we  compared  speedups  between  the  gossip-based  and  tradi¬ 
tional  scheduling  ofload  balancing  actions.  As  shown  in  Figure  10,  gossip-based  scheduling  of  load 
balancing  can  effectively  detect  and  initiate  load  balancing  actions,  resulting  substantial  speedups, 
but  the  gossip-based  system  is  not  as  effective  as  conventional  load  balancing  at  this  scale.  When 
the  global  communication  is  frequent  (comm  scale  1),  the  gossip-based  and  conventional  approaches 
are  comparable,  but  the  speedup  in  the  gossip- based  approach  degrades  as  the  frequency  of  global 
communications  decreases.  This  degradation  is  due  to  increased  inconsistency  in  measured  load 
imbalance  between  processes,  resulting  aborts  in  the  3-wait  algorithm  and  missed  load  balancing 
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timeout  causes  abort 
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COORD  1  j  OTHER 

send  first  message 


wait  ack  from  all  with  timeout 

timeout  Da  uses  abort 
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wait  first  reply  with  timeout 

timeout  causes  abort 


•  if  answer  is  canceJ  .  abort 
if  answer  rs  commit,  commit 


Figure  9.  Time  flow  diagram  for  the  3-wait  algorithm 


opportunities. 

Testbeds  for  Conducting  Basic  Sensor  Network  Research:  Conducting  leading-edge  network 
protocol  and  data  transformation  research  on  sensor  networks  requires  an  adequate  system  for 
evaluating  results;  unfortunately,  existing  simulation  and  hardware  systems  are  frequently  inade¬ 
quate  for  this  purpose.  To  support  our  other  research  on  this  project,  we  evaluated  a  number  of 
different  sensor  network  platforms,  including  simulation  systems  such  as  TOSSIM  and  GloMoSim, 
and  have  developed  a.  hardware  testbed  system  to  use  to  conduct  additional  network  protocol  re¬ 
search  on  sensor  networks.  1  his  system  provides  basic  functionality  for  allocating  and  programming 
sensors,  and  we  have  recently  enhanced  this  system  with  new'  system  software  extensions  that  let 
us  control  failures  in  both  simulation- based  and  hardware-based  sensor  networks,  enhancing  and 
enabling  future  research  on  network  protocols  for  addressing  correlated  failures  in  sensor  networks. 
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Appendix  A.  Algorithm 

Algorithm  1  Algorithm  to  devise  task  reallocation  policies  for  multi-server  DCSs 

Require:  k,  7%,*  and  ,  with  j  =  1,  * . .  T  n,  i  ^  j 

Ensure: 

Sot  U%  —  {j  :  >  0},  u[  “0  and  k  —  1 

loop 

while  j  6  £/*  do 

lh^lh\{3} 

mi  =  mi  -  'Z.te.Ui  ~  u[  Lit}  and  m2  ==  mJ;i 

Solve  (3)  using  mi  and  m2  to  obtain 
U'^U-U  {;} 

end  while 

Set  Ut  —  {j  :  >  0},  Ut  —  0  and  A;  *—  A:  +  ] 

if  ^7=1  =  0  or  k  >  k  then 

Kij  —  K;y  for  all  j  €  Uj  and  exit 
end  if 
end  loop 
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